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The  idea  of  exploiting  both  task  and  data  parallelism  in  programs  is  ap|)ealitig,  llowev('r.  iihuil  ifying  realist  ic. 
yet  manageable  example  programs  that  can  benefit  from  such  a  mix  of  task  and  data  parallelism  is  a  major 
problem  for  researchers.  We  address  this  problem  by  describing  a  suite  of  five  application  from  the  domains 
of  scientific,  signal,  and  image  processing  that  are  of  reasonable  size',  are  repr<>sent  ative  of  real  codes,  and  can 
benefit  from  exploiting  task  and  data  parallelism.  The  suite  includes  fast  Fourier  transforms,  narrowband 
tracking  radar,  muitibaseline  stereo  imaging,  and  airshed  simulation.  ( 'omplett'  source  code  for  each  example 
program  is  available  from  the  authors. 
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1.  Introduction 


There  is  growing  interest  in  the  idea  of  exploiting  both  task  and  data  parallelism  [1,  ;5,  4,  5,  6,  7,  19].  There 
are  a  number  of  practical  reasons  for  this  interest.  For  many  applications,  especially  in  the  domains  of  image 
and  signal  processing,  data  set  sizes  are  limited  by  physical  constraints  and  cannot  be  easily  increased  [19|. 
In  such  cases  the  amount  of  available  data  parallelism  is  limited.  For  e.'cample.  in  the  mult ih;i.selin(>  steri'o 
application  described  later  in  this  report,  the  size  of  an  image  is  ileti'rmined  by  tin-  cireiiitry  of  tie’  vi<leo 
cameras  and  the  throughput  of  the  camera  interface.  Increasing  tin'  image  size  means  buying  iieu'  cameras 
and  building  a  faster  interface,  which  may  not  be  feasibh*.  Since  the  <|ata  paralh-lism  is  limited,  additional 
parallelism  must  come  from  tasking. 

.Another  reason  for  the  increased  interest  in  task  parallelism  is  that  simulations  are  becoming  increasingly 
sophisticated  as  they  attempt  to  capture  interactions  among  <liirerenl  physical  phenorniMia.  The  phenomena 
might  represent  different  scientific  disciplines,  and  different  parts  <>1  the  simiilatioti  might  even  b('  written  by 
different  groups.  For  example,  the  airshed  model  described  later  in  this  report  characterizes  the  formation  (d' 
air  pollution  as  the  interaction  between  the  wind  blowing  and  reactions  among  various  chemical  species.  It 
is  natural  to  model  such  interactions  using  tasking,  where  one  task  models  the  wind  blowing,  and  the  other 
tcisk  models  the  chemical  reactions.  Further,  if  the  codes  are  written  by  different  groups,  task  par.allelism 
may  be  the  only  feasible  way  to  integrate  the  codes. 

.Applications  that  can  benefit  from  a  mix  of  ta.sk  and  data  [larallelism  tiuid  to  be  somewh.U  complicated 
because  they  are  typically  composed  of  a  collection  of  nontrivial  functions,  each  of  which  is  a  seipieiice  of 
data  parallel  operations.  Identifying  and  building  representative  example  programs  that  are  a  manageable 
size  is  a  major  stumbling  block  for  computer  .science  re.searrhers  who  are  not  application  domain  experts, 
VVe  address  this  problem  by  describing  a  set  of  realistic  example  programs  from  the  domains  (d' scient ilh'. 
signal,  and  image  processing  that  can  benefit  from  a  mix  of  task  and  data  itarallelism: 

1.  ID  fast  Fourier  transform. 

2.  2D  fast  Fourier  transform. 

9.  Narrowband  tracking  radar. 

4.  .Multibaseline  stereo. 

5.  Airshed  simulation. 


We  identified  these  applications  in  the  course  of  developing  an  integrated  model  of  task  and  data  parallelism 
for  the  Fx  parallelizing  Fortran  compiler  [9,  19.  17.  18,  23]  and  have  found  them  to  be  extremely  helpful. 

(,'omplete  Fortran  77  sources  of  the  programs  are  available  from  the  authors.  Kach  program  is  finver  than 
oOO  lines  of  code.  The  Fortran  77  sources  are  useful  for  a  number  of  rea.sons.  First,  the  sources  jirovide 
an  unambiguous  specification  of  each  application,  including  input  aiul  output  data  sets.  Second,  there  .are 
many  models  and  dialects  for  task  parallelism.  Fortran  77  represents  a  lowest  commoii  denominator  of  sorts, 
available  to  everyone,  for  describing  the  applications.  Finally,  the  .source  code  clearly  identifies  the  obvious 
.sources  of  course-grained  parallelism  in  the  form  of  calls  to  subroutines.  This  (uiabled  us,  in  all  but  one 
case,  to  port  the  Fortran  77  programs  to  the  Fx  system  with  only  minor  modifications  (the  only  exception 
being  the  airshed  simulation,  which  requires  a  dynamic  nuxlel  of  task  parallelism  that  Fx  does  not  currently 
support).  We  invite  other  researchers  who  ar<'  intereste<l  in  i.ask  parallelism  to  use  these  Fortran  77  sources 
as  the  basis  for  writing  the  applications  in  their  favorite  task  parallel  dialect. 

Section  2  briefly  outlines  a  simple  space  of  the  different  ways  that  task  and  d.ita  parallelism  i-aii  be  used 
in  the  same  program.  .Sections  2-6  describe  the  example  programs  and  discuss  briefly  how  they  can  mapjied 
onto  a  parallel  system  using  a  mix  of  task  and  data  p.arallelism.  Section  7  th'scrilu's  where  to  find  the  online 
Fortran  77  codes  and  provides  some  more  detail  on  their  structure. 
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Combinations  of  task  and  data  paralbdisni 


Each  program  in  our  task  parallel  suite  has  the  following  form: 

do  i  =  l,m 
call  TlO 
call  T2() 
call  T3() 
enddo 


There  is  an  outer  loop  that  it<'rates  over  iii  input  datasets  The  body  of  the  loop  consists  nl' rails  lu  iliii' 
task-'iiibroutines.  Each  task-suhroutitte  typically  consists  of  a  seipience  of  ilata-parallel  st  ati  iiienis  i  e  ^  . 
array  assignment  statetnents  or  DOALL-like  parallel  loops)  that  can  ritti  on  imtltiple  processors  Depi  inlmn 
on  the  application,  there  may  or  may  not  be  dependences  across  iterations  of  the  (juler  loo[)  atid  aiiionu,  the 
task-subroutines.  We  can  graphically  represent  the  program  using  a  task  graph,  as  shown  in  Figure  I  i;,i;  !i 
node  corresponds  to  a  task-subroutine,  and  each  arc  corresponds  to  a  data  dejtendence  between  a  pair  <d' 
task-subroutines. 


T1  T2  T3 

Figure  1:  Example  t;usk  graph 


We  can  view  different  tcosk  and  data  paralh'l  mappings  for  an  application  as  points  in  a  i  wo-diiiiensiunal 
space  [18].  The  first  axis  corresponds  to  different  e/n.s/f  rini/.sof  lask-siibroutities.  where  each  task-subroutine 
in  cluster  runs  on  the  same  .set  of  processors,  and  each  cluster  rttiis  oa  a  unii|ue  set  id  processors  In  some 
cases,  iLssigning  two  task  subroutines  to  the  same  cluster  r<'<luces  lommunicat ion  overhead,  at  the  cci>t  o|' 
reduced  parallelism.  In  these  ca.ses.  there  is  a  tensioti  betwetui  rediicmg  cotiimntiicat ioti  cost  atid  increasing 
parallelism. 

If  till'  input  data  .sets  for  a  cluster  are  independent,  then  that  cluster  can  be  !•( piicaii  d.  For  example,  one 
replicated  instance  of  a  cluster  could  process  even  numbered  datas<*ts  and  the  other  replicated  instatice  could 
process  the  odd  ntimbered  data  sets.  I'lie  degree  of  replication  is  captured  by  the  secotid  axis.  He()licalion 
is  beneficial  for  task-subroutines  that  do  tiot  scale  well. 

Some  examples  of  different  combitiations  of  task  and  data  parallelistn  are  showtt  in  Figure  2.  Figure  2(,i) 
shows  the  usual  data  parallel  mapping  where  all  task-subroutines  are  clustered  onto  all  processtjrs,  with 
no  replication.  Figitre  2(b)  shows  a  task  parallel  mappittg  where  each  task-subroutitie  gets  its  own  i  lusti'r. 
again  with  no  replication.  It  tnay  be  desirable  iti  some  ca.ses  to  replicate  the  dal a-jiarallel  clusters,  as  m 
Figure  2(c),  Finally,  a  mix  of  task  and  data  parallelism,  using  both  clustering  and  replication,  is  shown  m 
Figure  2(d). 


3.  Fast  Fourittr  transform 


The  fast  Fourier  transform  (FFT)  converts  an  input  data  set  from  the  temporal/spatial  domain  to  the 
frequency  domain,  and  vice  versa.  While  it  is  an  important  algorithm  in  its  own  right,  with  numerous 
applications  in  scientific,  signal,  and  image  proci-ssing,  it  is  an  especially  interesting  example  program  for 
studying  task  parallelism  becau.se  it  exemplifies  a  common  computational  pattern,  i,e.,  manipiil.ile  the  d.il.v 
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Partitioning 


(c)  Replirated  data  (d)  Replicated  data  and 
parallel  mapping  task  parallel  mapping 


I'inurr*  2:  (  oMiliiiiaiujiis  ol’ task  ami  data  parallid  ma|)piiins 

row-wist',  lltfii  iiiaiiifiiiltilt'  tli*‘  'lata  |•olulml-w|s(^  Hits  pail<Tn  appi  ars  in  'livi‘i''i-  ippln  u  nnis  'in  li  i-- 
iiu'tiiral  iniaujim;  [Id],  synthetic  atx'rtum  radar  im.ii>Mii;  (1-')|,  .Vni  solvi  rs,  sonar  li'  anirorinini;  and  tin  radar 
and  airshnd  proirrams  dnscrilxal  lainr  in  tins  .lociinn'iii 

More  preri.sely.  a  discrete  Idjuner  t  rati.sl'ortii  (Dl'Tl  is  a  (oiiiplex  in.ai  ri.\-\e(i(ir  pro.lm  i 

.'/  =  /  •.•f 

where  X  and  //  are  comple.v  vectors,  and  =  I i.s  .an  n  ■  ’>  iiiairi.x  sin  h  iImi  /  ,  =  _  v  ln  i" 

=  cos(2T/n)  -  isin(2T/n)  =  , 

and  /  =  \/—  1 ,  A  fast  Fourier  transform  (  FFT)  is  an  efficjenl  pr<x-ednre  for  lonipiit  ini;  i  lie  |)F  I  that  e\|)|oiis 
striicture  in  F„  to  compute  the  DFT  matrix-vector  pro<luci  m  fd(  n  Ion  n )  nine  llinlier  dimensional  11-  Is 
are  also  defined.  In  neneral.  an  m-dimensional  FFT  operates  on  an  tn-dmiensional  matrix  with  n  i  lenn’iils 
m  each  dimension  in  Oin”'  lonn)  time.  See  [20]  for  an  excellent  description  o|  the  nniiieroiis  I  !■  I  dnoriiliiiis 
that  have  been  developed  over  the  past  10  yi’ars. 

3.1.  ID  fast  Foiirif>r  transform 

If  I)  =  niti-.i,  then  a  ID  FFT  can  he  computed  iisim;  a  collection  of  sm.dler  independent  ID  I  Ids  [2,  s  2ill 
Starting  with  a  view  of  the  input  vector  x  as  an  iii  x  n-_.  array  .1.  stored  in  colimin  major  order,  perlorni 
rii  independent  n2-point  FFTs  on  the  rows  of  .1,  scale  each  element  ol  the  resiillinn  array  liy  a  lacior 

of  jjfid  then  perform  rin  independent  n|-point  FFls  on  the  columns  of  .1  The  linal  result  vector  >/  is 
obtained  by  concatenating  the  rows  of  the  result  mg  array.  Figure  :(  shows  the  task  graph  lor  the  ID  11  I 

Input  and  output  are  seipiences  of  vectors  rr*shaped  as  2D  arrays  Nodes  labeled  trails  perlorm  a 
transpose  operation,  nodes  labeled  col  FFTs  perform  a  set  of  lid  !■  I'  Is  on  the  colimins  ol  its  input  .irr.ay, 
ami  the  node  labeled  scale  multiplies  each  element  of  its  input  array  by  a  conslani  lo  exploit  localili  in 
the  memory  subsystem,  the  program  implements  each  s<‘t  of  row-wise  opi'rations  as  a  Iranspeise  followed  by 
a  set  of  coliimn-wis*-  operations  I  liis  specilic  order  is  an  artifact  of  the  fact  that  l  oriran  77  stores  matrices 

;t 


Figure  ;5:  ID  FFT  task  graph  lor  oii<-  itiput  vci  ior 


in  column-major  order.  If  the  example  program  wer<'  written  in  (',  earli  roluiim- wise  uperaiion  would  he 
implemented  as  a  transpose  followed  hy  a  set  of  row-wise  operations. 

Mapping  the  ID  FFT  onto  a  parallel  systiun  is  easy  in  some  ways  and  challengiiig  m  oiher  way-  1  In 
problem  is  easy  in  the  sense  that  the  colututi-wise  FFTs  aiul  the  scaling  cjpiTaiion  are  pei  l'ecily  parallel  ,and 
easily  represented  by  conventional  data  parallel  constructs.  The  column-wi.se  FF'l  -  .ire  naiur.ally  '-xpressed 
with  DOALL-like  parallel  loop  construct  such  as  the  HPF  independent  DO  statement  [10]  The  scaling 
operation  can  be  expressed  with  a  Fortran  90  array  russignment  statement.  However,  the  [irnblem  is  chal¬ 
lenging  because  the  transpose  operation  requires  an  idFicient  redistribution  of  data,  usually  in  the  form  of  a 
complete  exchange  where  each  processor  must  send  data  to  every  other  (irocessor  Since  the  irray- 

are  independent,  both  replicatioti  and  clustering  of  th«>  i.ask  graph  an'  possible 

3.2.  2D  fast  Fourier  traiisforni 

f  'omputing  the  21)  FFT  is  situilar  to  computing  the  ID  FF  I'.  (livi'ii  an  /i|  ■  n-  input  array  .1.  perform  u  . 
itidependent  ;ii-point  FFTs  on  the  colutmis  of  .1.  followed  by  ii]  independent  tij-poini  ID  If  Is  on  flie  row- 
of  A.  Figure  I  shows  the  task  graph  for  the  2D  FFT.  .\s  with  the  ID  FF  f.  row-wise  FF  I  s  .ire  ri  placed  by  ,a 
transpose  followed  by  a  set  of  column-wise  FFTs.  Notice  that  the  21)  FF  F  is  simpler  th.an  the  ID  111  No 
scaling  i.s  required  and  there  is  one  fewer  transpose  operation.  .Again,  the  input  and  output  .an'  secpieina's  .,f 
arrays,  and  since  these  arrays  are  indepeiuh'iit .  both  repljc.at ion  and  clustering  of  iht'  i.isk  gr.aph  are  po-sible 


col  col 

input - ^  - ^  trans - - Irans - output 


Figure  1:  2D  FFT  ttisk  graph  for  om*  input  .array 


4.  Narrowhand  tracking  radar 


The  narrowband  tracking  radar  benchmark  was  developed  by  rest'archers  ;il  MIT  bincoln  Laboratories  to 
mea.surp  the  effectiveness  of  various  multicomput.ers  for  their  radar  applit  tions  [Hi]  It  is  an  interesting 
I>rogram  for  studying  task  parallelism  because  of  its  hard  r<'al-time  retpiirements,  and  because  the  si/e  of  the 
input  tiata  set  is  limited  by  physical  properties  of  the  radar  sensor.  The  l.isk  graph  for  the  radar  application 
is  shown  in  Figure 

The  program  inputs  data  from  a  single  sensor  along  r  =  1  independent  rhnuiitls.  Fvery  .')  millise<-onds. 
for  each  channel,  the  program  receives  d  =  .'>12  complex  vectors  of  length  r  =  10.  one  .after  the  other  m 
the  form  of  .an  r  y  d  coniplex  .array  .1  (.assuming  the  column  major  ordering  of  Fortran).  At  ,a  high-level. 
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Ki)i;ur<‘  ■'):  Hiidar  tusk  graph  lor  oii<-  input  array 


each  input  array  .1  is  proct'sserl  in  the  following  way:  (1)  Cortxr  turn  the  f  s  </  inpiii  array  to  lorm  a  J  -■  r 
arrav.  ('2)  Perforni  r  iiulepeiuleiit  i/-poinl  I'f  rs,  (d)  ( 'oiiverl  ihi'  resiilliiin  l•oMlplex  il  ■  r  array  lo  a  n  al 
(I  X  r  suharray,  ir  =  ID.  hy  replacing  r'ach  eleiiient  <r  +  il>  in  the  in  -  c  >nliarray  with  iis  x,  alerl  iiiai;nii  inh 
}/a-  +  A-/r/,  (1)  riirr'shohl  '‘ach  elenient  r/^*.  of  the  suharray  using  a  ciilolf  that  is  a  fniiciiun  of  an<l  ilie 
sum  of  the  suharray  elenn'iits.  tllements  that  are  above  the  thrr'shohl  are  set  to  nnily;  elriin  iils  hrlow  the 
threshold  are  set  to  zero. 

The  corner  turn  operation  is  erpiivalent  to  a  transpose,  so  it  can  poh'iitially  induce  a  coiiiplele  e.xchange 
where  each  processor  communicates  with  every  other  processor.  As  with  the  II)  FFT.  the  column  I'l'  Is. 
scaling,  and  thresholding  op<*rations  can  he  naturally  expressed  using  conventional  data  parallel  const  rni  ts. 
Further,  the  reduction  operation  reipiires  an  efficient  reduction  mechanism,  llowevc'r.  ilie  mosi  inierexiuig 
computational  proiterty  of  the  radar  henchmark  is  the  fact  i  hat  tin-  size  parame  ters  r.  d.  c.  ami  a  ao' 
ileti'rmined  hy  mother  mature  .and  l  he  properties  of  >  iirrent  seaisor  tee  liiiejlogy.  File’  luxury  eef  .ximply  iiie  re'asiiin 
ilie‘  elata  .set  size  sitnply  eloe's  not  e-xist  in  this  eaxse-.  The  .amount  e>f  av.ail.ahle'  low-le’ve-l  .lal.a  par.illedisiii  is 
lirnite'd,  so  tulditiontil  paralledism  must  come  from  higher-level  task  paralledism.  hike'  ilu’  I'l'  T  e  x.aiii|i|e's, 
input  elata  .sets  are  indepe'iident.  so  both  re‘[>liral ion  and  e  lusteriiig  eif  the'  task  gr.apli  are>  peissil  lee 


5.  Multibaseliiie  stereo 


The'  multiha.seline'  ste'reo  ii.se's  an  .algorithm  elevelope'd  at  ( ’arnegie  Mi’llon  I  hat  give's  gre'.ate'r  acciir.acy  in  ele'pl  li 
through  the  use  of  more  than  two  cameras  [11].  It  is  an  intere'sting  program  for  stiielying  t;tsk  paralle'lisiii 
because  it  contains  significant  amounts  of  both  inter-task  and  intra-t.ask  communic.ation[2‘2].  .and  be'.-.anse', 
like  the  radar  example,  the  size  of  the  input  data  sets  cannot  be'  e'asily  incre,a.sed.  Our  imple'iiie'iilal loti  is 
adapted  from  a  previous  data-parallel  implementation  written  in  a.spe'cialized  iimige  proce-ssing  l.angiiage'  ['Jlj. 

Figure  6  shows  the  task  graph  for  the  stereo  program.  Input  consists  of  thre'e  m  x  n  im, age's  .aci|uire'il 
from  three  horizontally  aligned,  e?<iually  spaceal  camer.a-s.  One*  image  is  the'  rrfrtrrtrr  ttttfu/i.  the'  ollie'r  two 

are  match  images.  For  e;ach  of  Hi  disparities,  d  =  0 . 1.'),  the-  first  match  image'  is  shille'il  by  ./  pixels. 

the  second  image  is  shifted  by  2d  pixels.  A  difference  image  is  formed  by  computing  the'  sum  of  sipi.ari'el 
differences  between  the  corresponding  pixels  of  the  re'fere-ne'e'  image  ami  the-  sliifle'd  iiiaieh  iiiuigi's.  N'e-xl. 
an  error  image  is  formed  by  replacing  e.ach  pixel  in  the'  elifference  ini.age  with  the'  sum  of  the-  pixi'Is  in  a 
surroiineiing  13  x  13  window.  A  disparitr/  image  is  then  forme'el  by  finding,  for  e.ach  pixe'l,  the'  elisparity  I  lull 
minimizes  error.  Finally,  the  elepth  of  each  pixel  is  displayed  as  a  simple  funcl  ion  of  its  disparity. 

The  stereo  program  reipiires  efficient  mechanisms  for  broaiictusting  tind  ri'diicing  l.argi'  data  si'ls.  flie 
computation  of  the  difference  images  recpiires  si.  iple  pointwise  operations  on  the  thri'e  input  images  .and 
can  thus  be  naturally  expres.sed  with  Fortran  'JO  array  statements.  The  com|Milatioii  of  t  in'  error  imagi's  is 
somewhat  more  interesting,  being  similar  to  a  convolution  operation.  The  convolution  can  In'  modi'li'd  as  a 
DOALL  where  the  loop  Iterations  operate  on  overlapping  regions  of  the  image,  which  nn'ans  that  proci'ssors 
must  communicate  before  the  loop  iterations  can  begin  I'xecuting,  As  with  tin'  FFT  and  radar  programs, 
the  data  .sets  are  independent,  so  both  replication  and  clustering  of  the  task  graph  are  possibh'. 
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Figure  (y.  Mullibciseline  stereo  task  grapli  tor  one  input  data  si't 


6.  Airshed  simulation 


The  airslied  simulation  is  significantly  more  coitiplex  tliati  the  ()revions  exanifth's.  Flic  innli iscah'  airsln  d 
model  captures  the  formation,  reactioti.  and  transport  of  atmospheric  iiollnianis  and  rr'Iaied  ihi  iiue.d 
species  [ll,  12],  It  is  ati  interesting  applicatioti  tx'cause  it  re(|uir('s  a  dyn:imic  task  piiralh'l  m(;di’l.  and 
because  different  parts  of  the  application  exhibit  widely  varying  amounts  of  DOAhh  paralh’lism. 

The  airshed  application  sittiulates  the  behavior  of  the  ;iirsh<'(l  modi  l  when  it  is  applied  lo  .s  rheniical 
species,  distributed  over  rlotnains  containing  /»  gritl  poittts  in  <  itch  of  /  .at  imrspheric  layi'rs.  t  ypical  v.alues  ari' 
,s  =  d,')  species,  ,j00  <  p  <  -aOOO  grid  points,  attd  t  =  -a  atmrxsphi'ric  layers,  llecansi-  of  ilie  midiiscale  grul, 
the  entire  northeastern  United  ,States  can  be  tnodeled  with  problems  itt  this  size  rangi'.  .A  lotid  of  about  200 
chemical  reactions  are  modeled. 

The  progratn  computes  in  two  principle  pha.ses:  (1)  horizotital  transi)orl  (using  a  liniie  element  method 
with  repeated  application  of  a  direct  solver),  followed  by  (2)  chetnistry/vertical  transiiort  (using  .an  iti'r.at  ive. 
predictor-corrector  method).  Figure  7  depicts  the  t:isk  graph  for  one  hour  of  simulated  timi‘.  Input  is  ,ati 
I  K  sx  p  concentration  array.  Initial  conditions  are  input  frotii  disk  (inputhour),  .and  in  a  preprocc'ssing  ph;\se 
for  the  horizontal  transport  phases  to  follow,  the  finite  element  stiffness  matrix  for  each  layer  is  assembli  d 
and  factored  (pretrans).  The  atmospheric  conditions  captured  by  the  stilfness  imvtrix  are  .a.ssumed  to  be 
constant  during  the  simulated  hour,  so  this  step  is  performed  just  once  p('r  hour,  ’fhis  is  followial  by  a 
secpience  of  steps  —  the  number  of  steps  is  one  of  the  initial  conditions  —  wher<'  I'ach  step  consists  of  ,a 
horizontal  transport  phase,  followed  by  a  chemistry /vertical  transport  pha,se,  followed  by  another  horizont;\l 
transport  pha,se.  Each  horizontal  transport  phase  performs  la  back.solves,  one  for  each  layer  and  spi  cies,  ,\ll 
may  be  computed  independently;  however,  for  each  layer  /,  all  backsolves  ust'  the  satiu'  lactored  m.itrix  ,1/ 
The  chemistry/vertical  transport  phase  performs  an  independent  cotnpntation  for  (-acli  of  th<‘  /)  grid  poitiis 
Output  for  the  hour  is  an  updated  concentration  array,  which  is  tlnm  input  to  tin*  next  hour. 

A  number  of  interesting  issues  arise  when  we  map  the  airshed  to  a  parallel  system.  In  t  he  ol  Ikt  example 
programs  we  have  discussed,  the  number  of  tasks  is  known  at  compile  time.  However,  in  l  lu'  .airslu'd  proj^ram, 
the  number  of  transport/chemistrj /transport  steps  for  each  hour  is  not  knowti  utitil  runtimi-,  which  itnplies 
a  dynamic  model  of  task  parallelism,  ALso,  since  the  output  concentration  array  of  one  hour  is  the  iiipiit  to 
the  next  hour,  replication  of  the  task  graph  is  not  feasible,  as  it  was  with  tin'  previous  examph'  programs. 

Another  issue  is  that  the  preprocessing  phas<!,  the  transport  phastp  aiul  the  chemistry  phase  iuiva-  very 
different  levels  of  obvious  DOALL  parallelism  beraii.se  the  sizes  of  the  differi'Ul  dimensions  of  the  conci'ii- 
tration  array  differ  by  orders  of  magnitude.  For  example,  the  preprocessing  phase  indi'piMuh'iit ly  computes 
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stiffness  matrices  for  each  layer:  unfortunately  there  are  only  5  layers,  .so  the  obvious  DOALL  aitproach  will 
use  o  proces.sors.  To  get  better  utilization,  we  niust  parallelize  the  computation  for  each  layer,  or  wi'  mu.st 
try  to  employ  task  parallelism  to  pipeline  the  compittation  for  each  layer,  or  both. 

The  i.ssues  involved  in  mapping  the  tran.spr)ri  phasi'  an*  particularly  iniere.si itig.  .Since  there  are  /  =  a 
layers  and  .s  =  tib  species,  the  tratisprrrt  phtise  crritld  be  easily  imph'iiiented  with  doubly  ue'sied  l')().\I,Ls 
that  consist  of  17b  indept’nrleiit  loop  iter.ations,  I'l/r  modi'rali*  si/i'il  parallel  systi>ms.  with  say  ti  l  proci’S'-ors, 
this  approach  might  work  well.  Ilowes'er,  for  larger  systems,  with  ,';iy  bl2  prorcs.S(/rs.  this  .'ipitroaeh  use,-. 
only  a  fraction  of  the  processors.  .Xs  with  the  preprocessing  phase,  wi*  can  gi't  better  iit  iliz.at ion  by  I'iiher 
parallelizing  the  sparse  finite  element  compitiat ion  (a  difliciilt  task)  or  trying  to  use  task  p.iralli  lisni  in 
pipeline  the  computation. 

The  final  issue  stems  from  the  fact  that  the  pn'processing,  horizontal  transport,  and  chemist ry/veri ic.al 
transport  pha.ses  each  operate  on  different  <litnetisiotis  of  the  cono'iitraiioii  .array.  I'o  e.\:[)|oil  loc.afiiy  in  the 
mi'inory  hierarchy,  an  implementation  will  most  likely  insert  the  appropriate  transposi'  operation  before  e.ai  h 
plnuse.  On  a  parallel  system,  this  can  itnliice  a  comph'tt*  I'xchange  when*  e.ai  It  processor  communicates  with 
every  other  processor.  Again,  .os  wit  ft  the  f'F'f  atid  radar  exampit's,  wt*  sei*  thi*  nt'ed  for  an  efficient  ctjinpleii- 
I'xchange  mechanism. 


7.  Distribution 

Complete  Fortran  77  sources  for  the  application  described  in  this  report  .are  avail.ible  via  anonymous  1  1  1’ 
from  waurp. cs .  emu.  edu  in  file  Ix-codes/tpsuite/tpsuite.tar.  World  Wide  Web  clients  like  Mosaic  and 
Cello  can  use  the  following  CRL; 

ftp: / /warp. cs . emu. edu/usr/anon/fx-eodes/tpsuite/tpsuite . tar 

Each  source  program  is  a  vanilla  Fortran  77  code  that  operates  on  a  se(|uenre  of  injuits  .and  produces  a 
sequence  of  outputs.  There  are  ”o  data  files;  all  input  datasets  are  produced  automat  ic.ally  by  the  progr'.m. 
Each  program  ha-s  fewer  than  bOO  lines  of  code  .and  consists  of  a  single  source  and  include  file.  I'he  include 
file  contains  size  constants  that  can  be  changed  if  the  researcher  wants  to  measun*  scalability,  liach  program 
(except  for  radar)  checks  its  output  automatically.  The  radar  corle  prints  a  few  lines  of  output  which  can 
be  easily  verified  by  the  user:  directions  .are  provided  in  the  source  rode.  S.amf)ie  outputs  of  the  progr.ams 
running  on  a  DEC  3000  /XIpha  workstation  are  also  provided.  E.ach  program  (except  the  airshi'd)  computes 
physically  meaningful  results.  To  keep  the  program  at  a  man.age.able  size,  we  h.ave  provided  .a  version  of  the 
airshed  simulation  that  uses  a  synthetic  worklo.ad  for  the  innermost  loops. 
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8.  Coiichniing  remarks 


We  have  clescrihed  a  suite  of  realistic  (>rograms  that  can  heiietit  from  a  mix  of  task  and  dal;i  paralli  lism. 
Researcliers  in  the  ari'as  of  mapping,  parallelizing  compilers,  and  par.illel  programming  e.ivironmenis  an 
invited  to  use  t h(“se  programs  to  test  and  validate  their  i<le;us  for  exploiting  task  and  data  paralhdisiii  m 
apidicai  miis. 
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